39 research outputs found

    Zipf extensions and their applications for modeling the degree sequences of real networks

    Get PDF
    The Zipf distribution, also known as discrete Pareto distribution, attracts considerable attention because it helps describe skewed data from many natural as well as man-made systems. Under the Zipf distribution, the frequency of a given value is a power function of its size. Consequently, when plotting the frequencies versus the size in log-log scale for data following this distribution, one obtains a straight line. Nevertheless, for many data sets the linearity is only observed in the tail and when this happens, the Zipf is only adjusted for values larger than a given threshold. This procedure implies a loss of information, and unless one is only interested in the tail of the distribution, the need to have access to more flexible alternatives distributions is evidenced. The work conducted in this thesis revolves around four bi-parametric extensions of the Zipf distribution. The first two belong to the class of Random Stopped Extreme distributions. The third extension is the result of applying the concept of Poisson-Stopped-Sum to the Zipf distribution and, the last one, is obtained by including an additional parameter to the probability generating function of the Zipf. An interesting characteristic of three of the models presented is that they allow for a parameter interpretation that gives some insights about the mechanism that generates the data. In order to analyze the performance of these models, we have fitted the degree sequences of real networks from different areas as: social networks, protein interaction networks or collaboration networks. The fits obtained have been compared with those obtained with other bi-parametric models such as: the Zipf-Mandelbrot, the discrete Weibull or the negative binomial. To facilitate the use of the models presented, they have been implemented in the zipfextR package available in the Comprehensive R Archive Network.La distribución Zipf, también conocida como distribución discreta de Pareto, atrae una atención considerable debido a su versatilidad para describir datos sesgados provenientes de diferentes entornos tanto naturales como artificiales. Bajo la distribución Zipf, la probabilidad de un valor dado es proporcional a una potencia negativa del mismo. En consecuencia, al dibujar en escala doble logarítmica las frecuencias, de datos provenientes de esta distribución, en función de su tamaño, se obtiene una línea recta. Sin embargo, en muchos conjuntos de datos, esta linealidad solo se observa en la cola, y cuando esto sucede, la distribución Zipf solo se ajusta para valores mayores que un umbral dado. Este procedimiento implica una pérdida de información, y a menos que a uno solo le interese la cola de la distribución, se pone de manifiesto la necesidad de disponer de distribuciones alternativas con una mayor flexibilidad. El trabajo realizado en esta tesis gira en torno a cuatro extensiones bi-paramétricas de la distribución Zipf. Las dos primeras pertenecen a la familia de distribuciones Random Stopped Extreme. La tercera extensión es el resultado de aplicar el concepto Poisson-Stopped-Sum a la distribución Zipf y, la última familia de distribuciones se obtiene al incluir un parámetro adicional a la función generadora de probabilidad de la Zipf. Una característica de tres de los modelos presentados es que proporcionan una interpretación directa de sus parámetros, lo que permite extraer algunas ideas sobre el mecanismo subyacente que ha generado los datos. Con el objetivo de analizar la aplicabilidad de estos modelos, hemos ajustado secuencias de grados de redes reales de diferentes áreas tales como: redes sociales, redes de interacción de proteínas y redes de colaboración. Los ajustes obtenidos se han comparado con los obtenidos con otros modelos bi-paramétricos como: el Zipf-Mandelbrot, la distribución discreta de Weibull o la binomial negativa. Para facilitar el uso de los modelos presentados, estos se han implementado en el paquete de R zipfextR, disponible en el Comprehensive R Archive Network.Estadística i Investigació Operativ

    Classifier selection with permutation tests

    Get PDF
    This work presents a content-based recommender system for machine learning classifier algorithms. Given a new data set, a recommendation of what classifier is likely to perform best is made based on classifier performance over similar known data sets. This similarity is measured according to a data set characterization that includes several state-of-the-art metrics taking into account physical structure, statistics, and information theory. A novelty with respect to prior work is the use of a robust approach based on permutation tests to directly assess whether a given learning algorithm is able to exploit the attributes in a data set to predict class labels, and compare it to the more commonly used F-score metric for evaluating classifier performance. To evaluate our approach, we have conducted an extensive experimentation including 8 of the main machine learning classification methods with varying configurations and 65 binary data sets, leading to over 2331 experiments. Our results show that using the information from the permutation test clearly improves the quality of the recommendations.Peer ReviewedPostprint (author's final draft

    The Zipf-Polylog distribution: Modeling human interactions through social networks

    Get PDF
    The Zipf distribution attracts considerable attention because it helps describe data from natural as well as man-made systems. Nevertheless, in most of the cases the Zipf is only appropriate to fit data in the upper tail. This is why it is important to dispose of Zipf extensions that allow to fit the data in its entire range. In this paper, we introduce the Zipf-Polylog family of distributions as a two-parameter generalization of the Zipf. The extended family contains the Zipf, the geometric, the logarithmic series and the shifted negative binomial with two successes, as particular distributions. We deduce important properties of the new family and demonstrate its suitability by analyzing the degree sequence of two real networks in all its range.Peer ReviewedPostprint (author's final draft

    Randomly stopped extreme Zipf extensions

    Get PDF
    In this paper, we extend the Zipf distribution by means of the Randomly Stopped Extreme mechanism; we establish the conditions under which the maximum and minimum families of distributions intersect in the original family; and we demonstrate how to generate data from the extended family using any Zipf random number generator. We study in detail the particular cases of geometric and positive Poisson stopping distributions, showing that, in log-log scale, the extended models allow for top-concavity (top-convexity) while maintaining linearity in the tail. We prove the suitability of the models presented, by fitting the degree sequences in a collaboration and a protein-protein interaction networks. The proposed models not only give a good fit, but they also allow for extracting interesting insights related to the data generation mechanism.Peer ReviewedPostprint (author's final draft

    Proceedings of the “Think Tank Hackathon’’, Big Data Training School for Life Sciences Follow-up, Ljubljana 6th – 7th February 2018

    Get PDF
    On 6th and 7th February 2018, a Think Tank took place in Ljubljana, Slovenia. It was a follow-up of the “Big Data Training School for Life Sciences” held in Uppsala, Sweden, in September 2017. The focus was on identifying topics of interest and optimising the programme for a forthcoming “Advanced” Big Data Training School for Life Science, that we hope is again supported by the COST Action CHARME (Harmonising standardisation strategies to increase efficiency and competitiveness of European life-science research - CA15110). The Think Tank aimed to go into details of several topics that were - to a degree - covered by the former training school. Likewise, discussions embraced the recent experience of the attendees in light of the new knowledge obtained by the first edition of the training school and how it comes from the perspective of their current and upcoming work. The 2018 training school should strive for and further facilitate optimised applications of Big Data technologies in life sciences. The attendees of this hackathon entirely organised this workshop.Peer ReviewedPostprint (published version

    The CHARME "Advanced Big Data Training School for Life Sciences": an example of good practices for training on current bioinformatics challenges

    Get PDF
    The CHARME “Advanced Big Data Training School for Life Sciences” took place during 3-7 September 2018, at the Campus Nord of the Technical University of Catalonia (UPC) in Barcelona (ES). The school was organised by the Data Management Group (DAMA) of the UPC in collaboration with EMBnet as a follow-up of the first CHARME-EMBnet “Big Data Training School for Life Sciences”, held in Uppsala, Sweden, in September 2017. The learning objectives of the school were defined and agreed during the CHARME “Think Tank Hackathon” that was held in Ljubljana, Slovenia, in February 2018. This article explains in detail the step forward organisation of the training school, the covered contents and the interaction/relationships that thanks to this school have been established between the trainees, the trainers and the organisers.Peer ReviewedPostprint (published version

    Elección de modalidades de trabajo, presencial u home office, en trabajadores residentes en Paraguay durante la pandemia del covid-19, 2022

    Get PDF
    The general objective is to determine the choice of work modalities, face-to-face or home office, in workers residing in Paraguay during the COVID-19 pandemic, 2022. It was based on a cross-sectional and descriptive quantitative approach. The population consisted of a total of 7,353,038 inhabitants in Paraguay, whose sample of 268 was calculated with a confidence level of 94.2%, a margin of error of 5.8%, and a degree of heterogeneity of 50%. The survey was used as a data collection technique and the instrument consisted of a questionnaire composed of 3 open and 32 closed questions based on five criteria and 35 indicators. It was surveyed via WhatsApp, for convenience, in May 2022. Only people residing in Paraguay, with a job and voluntary participation were taken into account. The results taking into account the higher frequency of responses in the face-to-face modality are: in terms of advantage, greater ease for the integration of new members (: 4.24) and disadvantages, greater increase in costs in office supplies (: 4.62). Regarding the home office modality, the advantage was greater use of technology (:4.61) and the disadvantage was greater lack of limits in working hours (:4.09). It is concluded that people prefer to work under the home office modality but they do not rule out the possibility of doing it in a mixed way either. Companies should seek the best work methodology for their employees, especially those that employ the millennial generation, who no longer wish to return to full-time work.El objetivo general es determinar la elección de modalidades de trabajo, presencial u home office, en trabajadores residentes en Paraguay durante la pandemia del COVID-19, 2022.  Se basó en un enfoque cuantitativo de corte transversal y descriptivo. La población consistió en un total de 7.353.038 habitantes en Paraguay, cuya muestra de 268 fue calculada con un nivel de confianza del 94,2%, margen de error 5,8% y grado de heterogeneidad 50%. Se utilizó la encuesta como técnica de recolección de datos y el instrumento consistió en un cuestionario compuesto por 3 preguntas abiertas y 32 cerradas basado en cinco criterios y 35 indicadores. Se encuesto vía WhatsApp, por conveniencia, en mayo de 2022. Se tomaron en cuenta sólo personas residentes en Paraguay, con un trabajo laboral y, participación voluntaria. Los resultados teniendo en cuenta las mayores frecuencia de respuestas en la modalidad presencial son: en cuanto a ventaja mayor facilidad para la integración de nuevos miembros (:4,24) y desventajas mayor aumento de costos en los suministros de las oficinas (:4,62). En cuanto a la modalidad home office, la ventaja mayor aprovechamiento de la Tecnología (:4,61) y desventaja mayor falta de límites en el horario de trabajo (:4,09). Se concluye que las personas prefieren trabajar bajo la modalidad home office pero tampoco descartan la posibilidad de hacerlo de forma mixta. Las empresas deberían de buscar la mejor metodología de trabajo para sus colaboradores, especialmente aquellas que ocupan a la generación millennials, quienes ya no desean volver a realizar trabajos full time

    Efectos estacionales en los mercados de capitales de la Alianza del Pacífico

    Get PDF
    Este documento tiene por objetivo probar la existencia de los efectos estacionales del día de la semana, mes del año, cambio de mes, fin de diciembre y superstición en los mercados de capitales de la Alianza del Pacífico durante el período 2002-2014. Para esto se emplea una metodología econométrica de estadísticos tradicionales y no paramétricos. Se pudo concluir que existe un efecto día de la semana para los mercados chileno, colombiano y peruano, el cual se comporta, en general, como lo sugiere la literatura, y un efecto cambio de mes para los mercados mexicano y peruano, que se comporta según lo sugerido por la literatura. No se detectó ningún efecto mes del año, fin de diciembre o superstición

    Anales del III Congreso Internacional de Vivienda y Ciudad "Debate en torno a la nueva agenda urbana"

    Get PDF
    Acta de congresoEl III Congreso Internacional de Vivienda y Ciudad “Debates en torno a la NUEVa Agenda Urbana”, ha sido una apuesta de alto compromiso por acercar los debates centrales y urgentes que tensionan el pleno ejercicio del derecho a la ciudad. Para ello las instituciones organizadoras (INVIHAB –Instituto de Investigación de Vivienda y Hábitat y MGyDH-Maestría en Gestión y Desarrollo Habitacional-1), hemos convidado un espacio que se concretó con potencia en un debate transdisciplinario. Convocó a intelectuales de prestigio internacional, investigadores, académicos y gestores estatales, y en una metodología de innovación articuló las voces académicas con las de las organizaciones sociales y/o barriales en el Foro de las Organizaciones Sociales que tuvo su espacio propio para dar voz a quienes están trabajando en los desafíos para garantizar los derechos a la vivienda y los bienes urbanos en nuestras ciudades del Siglo XXI

    Zipf extensions and their applications for modeling the degree sequences of real networks

    Get PDF
    The Zipf distribution, also known as discrete Pareto distribution, attracts considerable attention because it helps describe skewed data from many natural as well as man-made systems. Under the Zipf distribution, the frequency of a given value is a power function of its size. Consequently, when plotting the frequencies versus the size in log-log scale for data following this distribution, one obtains a straight line. Nevertheless, for many data sets the linearity is only observed in the tail and when this happens, the Zipf is only adjusted for values larger than a given threshold. This procedure implies a loss of information, and unless one is only interested in the tail of the distribution, the need to have access to more flexible alternatives distributions is evidenced. The work conducted in this thesis revolves around four bi-parametric extensions of the Zipf distribution. The first two belong to the class of Random Stopped Extreme distributions. The third extension is the result of applying the concept of Poisson-Stopped-Sum to the Zipf distribution and, the last one, is obtained by including an additional parameter to the probability generating function of the Zipf. An interesting characteristic of three of the models presented is that they allow for a parameter interpretation that gives some insights about the mechanism that generates the data. In order to analyze the performance of these models, we have fitted the degree sequences of real networks from different areas as: social networks, protein interaction networks or collaboration networks. The fits obtained have been compared with those obtained with other bi-parametric models such as: the Zipf-Mandelbrot, the discrete Weibull or the negative binomial. To facilitate the use of the models presented, they have been implemented in the zipfextR package available in the Comprehensive R Archive Network.La distribución Zipf, también conocida como distribución discreta de Pareto, atrae una atención considerable debido a su versatilidad para describir datos sesgados provenientes de diferentes entornos tanto naturales como artificiales. Bajo la distribución Zipf, la probabilidad de un valor dado es proporcional a una potencia negativa del mismo. En consecuencia, al dibujar en escala doble logarítmica las frecuencias, de datos provenientes de esta distribución, en función de su tamaño, se obtiene una línea recta. Sin embargo, en muchos conjuntos de datos, esta linealidad solo se observa en la cola, y cuando esto sucede, la distribución Zipf solo se ajusta para valores mayores que un umbral dado. Este procedimiento implica una pérdida de información, y a menos que a uno solo le interese la cola de la distribución, se pone de manifiesto la necesidad de disponer de distribuciones alternativas con una mayor flexibilidad. El trabajo realizado en esta tesis gira en torno a cuatro extensiones bi-paramétricas de la distribución Zipf. Las dos primeras pertenecen a la familia de distribuciones Random Stopped Extreme. La tercera extensión es el resultado de aplicar el concepto Poisson-Stopped-Sum a la distribución Zipf y, la última familia de distribuciones se obtiene al incluir un parámetro adicional a la función generadora de probabilidad de la Zipf. Una característica de tres de los modelos presentados es que proporcionan una interpretación directa de sus parámetros, lo que permite extraer algunas ideas sobre el mecanismo subyacente que ha generado los datos. Con el objetivo de analizar la aplicabilidad de estos modelos, hemos ajustado secuencias de grados de redes reales de diferentes áreas tales como: redes sociales, redes de interacción de proteínas y redes de colaboración. Los ajustes obtenidos se han comparado con los obtenidos con otros modelos bi-paramétricos como: el Zipf-Mandelbrot, la distribución discreta de Weibull o la binomial negativa. Para facilitar el uso de los modelos presentados, estos se han implementado en el paquete de R zipfextR, disponible en el Comprehensive R Archive Network.Postprint (published version
    corecore